Practical: outbreak reconstruction

Introduction

On the tidyverse

The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:

  • dplyr for handling data, making new variables, building summaries, number-crunching

  • tidyr to re-arrange tidy/data

  • ggplot2 to visualise data

  • magrittr for the piping operator (%>%)

Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:


Other packages

Other packages we will use include:

  • rmarkdown for automated report generation

  • rio to read .xlsx files in

  • ape for phylogenetic reconstruction

Cheatsheets and other resources:

  • the rmarkdown cheatsheet

  • the knitr website documenting many options used in rmarkdown’s settings and code chunks


On the RECONverse

Several RECON packages will be used in the practical:

  • linelist for data cleaning

  • incidence older package to build and visualise epidemic curves, and do basic model fitting; it is being replaced by incidence2 but some older packages like earlyR (for estimating the reproduction number) and projections (for short term forecasting) are still using it.

  • epicontacts to visualise and analyse contact tracing data or transmission chains

  • epitrix to estimate delay distributions

  • distcrete to handle delay distributions

  • EpiEstim for estimation of the reproduction number over time

  • outbreaker2 for outbreak reconstruction using multiple data sources

Note that earlyR and projections will soon be adapted to handle incidence2 objects.

For some of these packages, we recommend using the github (development) version, either because it is more up-to-date, or because they have not been released on CRAN yet.


What we will do today

This practical will follow up on the response to the fictitious Ebola outbreak in Morporkia. It will build and consolidate material seen in previous sessions, and will show how genetic data can be used to reconstruct transmission trees (who infects whom?), especially when combined with epidemiological data. We will particularly look at:

  • cleaning / standardising ‘dirty’ data
  • building epidemic curves
  • estimate serial interval distributions of contact data
  • build new distributions using parameters reported in the literature
  • visualise and analyse transmission chains / contact tracing data
  • estimate growth rates from epidemic curves
  • estimate time varying reproduction number
  • read in genetic data
  • make a simple phylogenetic tree
  • combine epidemiological and genetic data to reconstruct transmission trees

In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.


How this document works

This practical will take you through the reconstruction of a fictitious Ebola outbreak using various sources of data. It is pitched at an intermediate level, so that a fair fraction of the code is to be generated by the student. Wherever this is the case, code seen in previous practicals should come in handy. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.

It is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.




An update on the EVD outbreak in Ankh, Republic of Morporkia

This practical is the second part of the response to a simulated Ebola Virus Disease (EVD) outbreak taking place in the city of Ankh, Republic of Morporkia. While the first part focussed on early assessments of transmissibility, this part explores more methodological options for estimating transmissibility, and provides an introduction to outbreak reconstruction using outbreaker2.

After some rather concerning on the new EVD outbreak in the city of Ankh, Republic of Morporkia, Public Health Morporkia (PHM) has sent you updated linelists and contact data. This time, PHM has also obtained Whole Genome Sequences (WGS) of the Ebola virus isolated in patients. As before, you are asked to assess the situation and produce evidence-based recommendations for informing the response.


The new data

The data update includes new linelists and contact lists:

  • PHM-EVD-linelist-2017-12-01.xlsx: a linelist containing case information up to the 1st December 2017

  • PHM-EVD-contacts-2017-12-01.xlsx: a list of contacts reported between cases up to the 1st December 2017, where from indicates a potential source of infection, and to the recipient of the contact


Exercise: Re-using code you used previously, import and clean the data. We recommend the following steps:

  1. create a data/ folder in your project directory

  2. download and save the data files in data/

  3. import the files using here::here to find the path to the data, and rio::import to read the files and import them into R; save the linelist as an object called linelist_raw, and contacts as contacts_raw

  4. import the cleaning rules from the corresponding excel file and save the object as cleaning_rules

  5. use linelist’s function to clean the data, using guess_dates on the relevant columns to make sure dates are useable, and the cleaning rules defined in cleaning_rules; save the clean (tibble) data as linelist and contacts


Once imported and cleaned, the data should look like:



Analysis of contacts and transmissibility


Analysis of contact data

After the initial stage of the outbreak, contact tracing has been maintained but started being sparser, and exposures haven’t been reported for all cases. Despite these limitations, contacts are still a valuable source of information. Using the function make_epicontacts in the epicontacts package, create a new epicontacts object called x, specifying that contacts are directed. When plotting the data, use the arguments node_shape and shapes (see ?vis_epicontacts) to distinguish males and females. Also, use node_color to distinguish individuals from Pseudopolis and Ankh Morpork. For a list of available symbols and corresponding shape code, you can type epicontacts::codeawesome.

The results should look like:

What can you say about these contacts? How would you interpret the different clusters?


Question / exercise:

  • What can you say about these contacts? How would you interpret the different clusters?

  • A colleague, epidemiologist at PHM, regularly states during meetings that Pseudopolis is the main ‘source’ of new infections (‘seeds’) into Ankh Morpork. Test this hypothesis (see ?get_pairwise). What do you think?


Explanation:

We build a contingency table using the locations of the infectors and the infectees using get_pairwise, and test whether there two variables are randomly associated using a Chi-square test. The non-parametric version of the test is used because of the low counts, which could bias the parametric version. Results show no association: the colleague from PHM is wrong in his statement that infections in Ankh Morpork mostly come from Pseudopolis.


Estimating growth rates from epidemic curves


Exercise:

Using approaches seen in previous practicals, build epidemic curves, and estimate the growth rate for the two cities separately, using the entire time series in each case. What results do you find? Can these results be trusted? Why?


Tips on estimating growth rates

There are two options for estimating the growth rates as requested:

  • the old way: use the incidence package to build the epidemic curves, and the function fit to estimate the growth rates using a log-linear regression, with the caveat that days with 0 incidence are ignored

  • the new way: use incidence2 to build the epidemic curves, and the function fit_curve from the i2extras package, specifying a Poisson model; you can then merely plot the results using plot, and derive corresponding growth rates and doubling/halving times using growth_rate

Explanation:

Results are not very conclusive, whether the ‘old’ or the ‘new’ approaches are used. The ‘old’ approach uses a log-linear model which ignores days of zero incidence, so that results can effectively be biased. The Poisson model should be a better option, but the fit as shown in the plot is very poor. Quite possibly, this is because there has not been a single temporal trend, and transmissibility may have changed since the beginning of the outbreak. For instance, the apparent growth in Ankh Morpork disappears when considering only data from the 1st November 2017 onward. A good complementary analysis will be to estimate time-varying reproduction numbers using EpiEstim.


Estimating time-varying reproduction numbers

When the assumption that \(R\) is constant over time becomes untenable, an alternative is the estimationg of time-varying transmissibility using the instantaneous reproduction number \(R_t\). This approach, introduced by Cori et al. (Cori et al. 2013), is implemented in the package EpiEstim (function estimate_R). It estimates \(R_t\) for a succession of sliding time windows, using the same Poisson branching process model as in earlyR. In the following, we:

  • estimate the serial interval by fitting a discretized Gamma distribution to the empirical data derived from the transmission chains

  • build an incidence object to serve as input to estimate_R; note that new incidence2 objects will soon be possible inputs for estimate_R as well

  • use estimate_R to estimate the time-varying reproduction number


Question:

What do you make of these results? What do you think is the probable course of the outbreak if nothing changes?




Finding who infected whom

To gain a better understanding of the transmission process, we can attempt to reconstruct plausible transmission trees using the dates of symptom onsets and limited contact data. This can be achieved using outbreaker2 (Campbell, Didelot, et al. 2018), which provides a modular platform for outbreak reconstruction. This package extends and replaces outbreaker (Jombart et al. 2014), which in contrast was a static implementation of a specific transmission model (Jombart et al. 2014).


Looking at Whole Genome Sequences (WGS)

WGS have been obtained for all cases in this outbreak. They are stored as a the fasta file phm_evd_wgs.fa). Download this file, save it in the data/ folder, and then run the code below to import the data in R:

As a first exploration of the data, we derive a Neighbour-Joining tree rooted at the first case of the outbreak:

This phylogenetic tree shows the inferred evolution of the pathogen sequences. Branch length (x-axis) correspond to the number of mutations occuring between lineages (indicated by the axis at the bottom). The tree has been rooted to the index case, so that this sequence (top, left) is the “most ancient” part of the tree. Note that in such representations, distances on the y-axis are meaningless.


Question:

How would you interpret this phylogenetic tree? Many methods of outbreak reconstruction infer transmission events from phylogenies. What results would you expect here?



Building delay distributions

outbreaker2 can handle different types of dates. When dates of onset are provided, information on the generation time (delay between primary and secondary infections) and on the incubation period (delay between infection and symptom onset) can be included in the model. These delays are typically modelled as Gamma distributions, which need to be discretised in order to account for the fact that time is reported as days.

Here, we will use the serial interval estimated from the transmission pairs as a proxy for the generation time. For the incubation period, your colleagues from PHM report that they observed average incubation times of 10.6 days with a standard deviation of 7.4 days. To build a discretized Gamma distribution from this data, we will:

  • use epitrix’s function gamma_mucv2shapescale to convert these parameters into shape and scale for a Gamma distribution

  • then use distcrete to generate discretised distributions using shape and scale

Similarly, we can visualise the serial interval distribution previously estimated from the transmission chains:


Using the original outbreaker model

The original outbreaker model combined temporal information (here, dates of onset) with sequence data to infer who infected whom. Here, we use outbreaker2 to apply this model to the data.

All inputs to the new outbreaker function are prepared using dedicated functions, which make a number of checks on provided inputs and define defaults:

We also create a configuration, which determines different aspects of the analysis, including which parameters need to be estimated, initial values of parameters, the length of the MCMC, etc.:

We can now run the analysis. This should take a couple of minutes on modern laptops. Note the use of set.seed(0) to have identical results across different users and computers:


Exercise:

The first two plots show the trace of the log-posterior densities (with, and without burnin). Look at other possible plots documented in ?plot.outbreaker_chains, and reproduce the graphs below to show results on:

  • ancestries
  • infection dates
  • mutation rate
  • transmission tree


As a further help for interpretation, you can derive a consensus tree from the posterior samples of trees using summary. Look in particular at the support column, and compare the results to the contact data.


Exercise: How would you interpret the results? Is this what you would have expected? As a point of comparison, repeat the same analysis using temporal data only, and plot a graph of ancestries (type = "alpha"); you should obtain something along the lines of:



Question: What is the usefulness of temporal and genetic data for outbreak reconstruction? Note that this topic has been discussed before (Campbell, Strang, et al. 2018). What other data would you ideally include?



Adding contact data to the reconstruction process

outbreaker2 also implements a module for taking contact data into account when reconstructing transmission chains (Campbell et al. 2019). Contact data currently contains case labels. While epicontacts objects will soon be accepted as inputs in outbreaker2, for now we need to operate some minor transformations to define contacts using cases indices rather than labels:

All inputs to the outbreaker function are prepared using dedicated functions, which make a number of checks on provided inputs and define defaults:

We are now ready to run the analysis. This may take a couple of minutes, depending on your computer:

Produce graphics as in the previous model. Assess convergence, choose an appropriate burnin, visualise ancestries and the infection timelines:


Question: How would you interpret the results? Derive a consensus tree using summary, and make a new epicontacts object, using the previous linelist, to visualise the consensus tree with meta-information:


In the following, we add age class information to the linelist of cons_tree, and create color palettes which will be used to display information on the final graph:

Looking carefully at the documentation of vis_epicontacts, try to reproduce the final consensus tree below:


Question: What are your conclusions? What are the main drivers of this outbreak? What recommendations would you make to further improve the response?


Explanation:

When adding contact data, the transmission tree becomes much clearer, with most ancestries being well-resolved and pointing to a single, highly probable infector. We can then visualise a consensus tree (the $tree component of the list output when calling summary() on outbreaker2 objects), i.e. the tree in which only the most likely infector is kept for each case. When we add information about the age groups to the consensus tree (using colors for the nodes), it seems all ‘super-spreaders’ were from the older age group. Further investigations should be made to understand this trend. Meanwhile, it may be a good idea to prioritise contact tracing followup from the older patients. In particular, case 48, a recent case who has at least infected one person, and may have caused more secondary transmissions.



References

Campbell, Finlay, Anne Cori, Neil Ferguson, and Thibaut Jombart. 2019. “Bayesian Inference of Transmission Chains Using Timing of Symptoms, Pathogen Genomes and Contact Data.” PLoS Comput. Biol. 15 (3): e1006930.

Campbell, Finlay, Xavier Didelot, Rich Fitzjohn, Neil Ferguson, Anne Cori, and Thibaut Jombart. 2018. “Outbreaker2: A Modular Platform for Outbreak Reconstruction.” BMC Bioinformatics 19 (Suppl 11): 363.

Campbell, Finlay, Camilla Strang, Neil Ferguson, Anne Cori, and Thibaut Jombart. 2018. “When Are Pathogen Genome Sequences Informative of Transmission Events?” PLoS Pathog. 14 (2): e1006885.

Cori, Anne, Neil M Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. “A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics.” Am. J. Epidemiol. 178 (9): 1505–12.

Jombart, Thibaut, Anne Cori, Xavier Didelot, Simon Cauchemez, Christophe Fraser, and Neil Ferguson. 2014. “Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data.” PLoS Comput. Biol. 10 (1): e1003457.